Goto

Collaborating Authors

 portable deep neural network


Residual Distillation: Towards Portable Deep Neural Networks without Shortcuts

Neural Information Processing Systems

By transferring both features and gradients between different layers, shortcut connections explored by ResNets allow us to effectively train very deep neural networks up to hundreds of layers. However, the additional computation costs induced by those shortcuts are often overlooked. For example, during online inference, the shortcuts in ResNet-50 account for about 40 percent of the entire memory usage on feature maps, because the features in the preceding layers cannot be released until the subsequent calculation is completed. In this work, for the first time, we consider training the CNN models with shortcuts and deploying them without. In particular, we propose a novel joint-training framework to train plain CNN by leveraging the gradients of the ResNet counterpart.


Review for NeurIPS paper: Residual Distillation: Towards Portable Deep Neural Networks without Shortcuts

Neural Information Processing Systems

Weaknesses: * There are numerous approaches to reduce ConvNet's memory footprint and computational resources at inference time, including but not limited to channel pruning, dynamic computational graph, and model distillation. Why is removing shortcut connection the best way to achieve the same goal? The baselines considered in Table 3 and 4 are rather lacking. For example, how does the proposed method compare to: 1. Pruning method that reduces ResNet-50 channel counts to match the memory footprint and FLOPs of plain-CNN 50. What will be the drop in accuracy?


Review for NeurIPS paper: Residual Distillation: Towards Portable Deep Neural Networks without Shortcuts

Neural Information Processing Systems

The new training scheme following the teacher-student paradigm to obtain comparable results to those of a resnet model, but without residual connections (shortcuts). Results are on par with SOTA and the approach is very interesting, although not necessarily very novel in principle (I encourage the authors to make this much clearer in the final text). All reviewers agree that this is a good contribution and that the rebuttal was helpful in reaching the final conclusion.


Residual Distillation: Towards Portable Deep Neural Networks without Shortcuts

Neural Information Processing Systems

By transferring both features and gradients between different layers, shortcut connections explored by ResNets allow us to effectively train very deep neural networks up to hundreds of layers. However, the additional computation costs induced by those shortcuts are often overlooked. For example, during online inference, the shortcuts in ResNet-50 account for about 40 percent of the entire memory usage on feature maps, because the features in the preceding layers cannot be released until the subsequent calculation is completed. In this work, for the first time, we consider training the CNN models with shortcuts and deploying them without. In particular, we propose a novel joint-training framework to train plain CNN by leveraging the gradients of the ResNet counterpart.


Adversarial Learning of Portable Student Networks

Wang, Yunhe (Peking University) | Xu, Chang (University of Sydney) | Xu, Chao (Peking University) | Tao, Dacheng (University of Sydney)

AAAI Conferences

Effective methods for learning deep neural networks with fewer parameters are urgently required, since storage and computations of heavy neural networks have largely prevented their widespread use on mobile devices. Compared with algorithms which directly remove weights or filters for obtaining considerable compression and speed-up ratios, training thin deep networks exploiting the student-teacher learning paradigm is more flexible. However, it is very hard to determine which formulation is optimal to measure the information inherited from teacher networks. To overcome this challenge, we utilize the generative adversarial network (GAN) to learn the student network. In practice, the generator is exactly the student network with extremely less parameters and the discriminator is used as a teaching assistant for distinguishing features extracted from student and teacher networks. By simultaneously optimizing the generator and the discriminator, the resulting student network can produce features of input data with the similar distribution as that of features of the teacher network. Extensive experimental results on benchmark datasets demonstrate that the proposed method is capable of learning well-performed portable networks, which is superior to the state-of-the-art methods.